DSM010 Big Data - CW2

Introduction

Italy is facing the worst drought of the last 70 years. For weeks, the water level of the river Po, the country’s main river system, has been so low in some areas that old shipwrecks are resurfacing. - Nature [1]

France drought: Parched towns left short of drinking water. The French government has set up a crisis team to tackle a historic drought that has left more than 100 municipalities short of drinking water. BBC [2]

Europe battles water shortages as severe drought sweeps the continent. An unusually dry winter and spring followed by a summer of baking heat are part of an overall warming trend. Financial Times [3]

These are just few of the headlines that are crowding newspapers and TV news these days. According to the 2021 Global Climate Report from NOAA National Centers for Environmental Information, every month of 2021 was warmer than average, despite the cooling influence from the La Niña climate pattern in the tropical Pacific. The nine years from 2013 through 2021 rank among the 10 warmest years on record. [4]

Altough the global warming has not been uniform around the globe, it shows a general upwarding trend; we can affirm that Earth’s temperature has risen by 0.14° Fahrenheit (0.08° Celsius) per decade since 1880, but the rate of warming since 1981 is more than twice that: 0.32° F (0.18° C) per decade. [4] Climate change is causing extra heat that is driving regional and seasonal temperature extremes, reducing snow cover and sea ice, intensifying heavy rainfall, and changing habitat ranges for plants and animals. Along with these changes we are witnessing the worst droughts in years in various countries, not only in Europe, but all around the world.

Being the drought crysis so alarming and something that will be at the center of the agenda of the most of the countries for the furture to come, we decided to investingate if we can possibly predict drought in some areas given meteorological and soil data.

The Data

Given the big data context of this course, I looked for datasets that were huge enough and that also contained useful and interesting information. The dataset has been collected from Predict Droughts using Weather & Soil Data on Kaggle, but the original source is the NASA Power Project website, that makes available useful data by using their satellite technology.

The dataset comes already splitted in 3 parts, and is presented as follows:

Speaking about the features it can be schematized as follows:

Indicator Description Type
WS10M_MIN Minimum Wind Speed at 10 Meters (m/s) double
QV2M Specific Humidity at 2 Meters (g/kg) double
T2M_RANGE Temperature Range at 2 Meters (C) double
T2M Temperature at 2 Meters (C) double
WS50M_MIN Minimum Wind Speed at 50 Meters double
T2M_MAX Maximum Temperature at 2 Meters (C) double
WS50M Wind Speed at 50 Meters (m/s) double
TS Earth Skin Temperature (C) double
WS50M_RANGE Wind Speed Range at 50 Meters (m/s) double
WS50M_MAX Maximum Wind Speed at 50 Meters (m/s) double
WS10M_MAX Maximum Wind Speed at 10 Meters (m/s) double
WS10M_RANGE Wind Speed Range at 10 Meters (m/s) double
PS Surface Pressure (kPa) double
T2MDEW Dew/Frost Point at 2 Meters (C) double
T2M_MIN Minimum Temperature at 2 Meters (C) double
T2MWET Wet Bulb Temperature at 2 Meters (C) double
PRECTOT Precipitation (mm/day) double
fips US county FIPS code int
date observation date string
score Drought level double

Our target is the drought level feature, that could be schematized as follows [4]:

Drought Levels

Also, later in the analysis, more data has been added from The US Drought Monitor, giving the possibility to estimate the soil impact on the different drought levels. The dataset, with 3019 entries and 31 features, is structured as follows:

Indicator Description
Latitude Minimum Wind Speed at 10 Meters (m/s)
Longitude Specific Humidity at 2 Meters (g/kg)
Elevation Median Elevation (meters)
slope1 0 % ≤ slope ≤ 0.5 %
slope2 0.5 % ≤ slope ≤ 2 %
slope3 2 % ≤ slope ≤ 5 %
slope4 5 % ≤ slope ≤ 10 %
slope5 10 % ≤ slope ≤ 15 %
slope6 15 % ≤ slope ≤ 30 %
slope7 30 % ≤ slope ≤ 45 %
slope8 Slope > 45 %
aspectN North: 0˚< aspect ≤45˚ or 315˚< aspect ≤360˚
aspectE East: 45˚ < aspect ≤ 135˚
aspectS South: 135˚ < aspect ≤ 225˚
aspectW West: 225˚ < aspect ≤ 315˚
aspectUnknown Undefined: Slope aspect undefined; this value is used for grids where slope gradient is undefined or slope
WAT_LAND mapped water bodies
NVG_LAND barren/very sparsely vegetated land
URB_LAND built-up land (residential and infrastructure)
GRS_LAND grass/scrub/woodland
FOR_LAND forest land, calibrated to FRA2000 land statistics
CULTRF_LAND
CULTIR_LAND irrigated cultivated land, according to GMIA 4.0
CULT_LAND total cultivated land
SQ1 Nutrient availability
SQ2 Nutrient retention capacity
SQ3 Rooting conditions
SQ4 Oxygen availability to roots
SQ5 Excess salts.
SQ6 Toxicity
SQ7 Workability (constraining field management)

Finally the total amount of disk space needed for our datasets is of ~ 3.5 Gb.

Framig the Problem and Looking at the Big Picture

First of all let's define what is drought. According to the National Drought Mitigation Center, there is no single definition for drought. When a drought begins and ends is difficult to determine. Rainfall data alone won't tell you if you are in a drought, how severe your drought may be, or how long you have been in drought. We can, however, identify various indicators of drought, such as rainfall, snowpack, streamflow, and more, and track these indicators to monitor drought. [5] Also, according to NASA, droughts are caused by low precipitation over an extended period of time. Atmospheric conditions such as climate change, ocean temperatures, changes in the jet stream, and changes in the local landscape are all factors that contribute to drought. [6] We can add to the cause of drought also land and water temperatures, soil moisture levels [7] and deforestation [8]. From these statements we can affrim that even if rain is an important factor, it need to be contestualized and correlated with all the others factors and time. Also having data about ocean temperatures or jet streams could be beneficial.

Furthermore, we need to frame our objective: we want to try to predict different drought levels given our data from meteorological relevations and soil specifications. Since we have our target already labeled, the problem can be presented as a multiclass supervised learning problem. As we will see later, data is higly imbalanced, so the classic performance measures (accuracy, precision, recall) won't be so much informative without context. Instead we will mostly evaluate our best model based on the weighted F1-score, more indicated for evaluating imbalanced datasets, and the weighted false positive rate.

Speaking about the models that we will consider, we need to underline how not all of them are suitable for multiclass classification. Amongst the siutables models we decided to use 3 of them: Decision Tree, Random Forest and OneVsRest with Logistic regression.

The study will be structured as follow:

  1. Importing the Data
  2. Exploring the data:
    1. Dwonsampling for an easier manipulation
    2. Univariate Analysis of the features
    3. Bivariate Analysis
    4. Adding more useful data
    5. Correlation
    6. Conclusions and action to undertake
  3. Preparing the data:
    • Feature engineering
    • Removing features
    • Transforming the features
  4. Model Building
  5. Hyperparamenter tuning
  6. Conclusioins and improvements

Importing Libraries and Initialising Spark

Getting the Data

  1. Data is stored on: hdfs://lena-master/user/mgarg001/CW2/
  2. Check the type of data (eg time series etc)
  3. Source: https://www.kaggle.com/datasets/cdminix/us-drought-meteorological-data

Importing the Data

Explore the Data

  1. Create a copy of the dataframes
  2. Studying attributes
    1. Check missing values
    2. Check data noise
    3. Usefulness for the task
    4. Type of distribuition
  3. Indentify the target
  4. Visualize the data
  5. Study the correlation between the data
  6. Experiment with attribute combinations
  7. Is extra data useful?
  8. Document learnings

Creating a copy of the Dataframe

Before going ahead, since we are delaing with a large number of entries, it would be ideal to sample our dataset for data exploration. We will start from the bottom and will asses the structure of our target and understande if a random sample is suitable or if a stratification is needed.

Target Study

The target feature of this dataset is the drought score. It goes froma level of 0 to a level of 4. image.png

The target is divided in 5 classes, plus missing data and none values. As presented, our first arrpoach would be to try to predict different levels of drought.

It seems that the score data is highly imbalanced. We have an high amout of Nan values. We will investigate this further later, and for the moment we will stratify our data accordingly for data exploration. Let's visualize the distribuition of the target without Nan values.

The zero label (abnormally dry), constitues more than 65% of the target minus the None values, while the highest level of drought constitutes only 0,8%. There are different approaches to tackle target imbalance:

  1. Downsampling/Upsampling/SMOTE (Synthetic Minority Oversampling Technique): These techniques permit to either change the dataset accordingly so that all the classes have the same amount of entries. Given the large amout of entries, downsampling could be a solution, but we could loose precious data.
  2. Create e new feature with the different targets weight scores: even if not all the models use "weight" as hyperparameter for imbalanced datasets, the evaluation class for multiclass problems permits the use of a weight column.
  3. Group the minority classes (1,2,3,4,5) into one unique label representing drought while the zero class still represent abnormally dry situation. This would split our target into a 65-45% imbalance - quite more manageable.

As we can see label stratification is preserved.

That's around 1.6% of our original data.

Data Exploration

Studying Attributes

Indicator Description Type Comment
WS10M_MIN Minimum Wind Speed at 10 Meters (m/s) double
QV2M Specific Humidity at 2 Meters (g/kg) double
T2M_RANGE Temperature Range at 2 Meters (C) double Checki collinearity with T2M features
T2M Temperature at 2 Meters (C) double
WS50M_MIN Minimum Wind Speed at 50 Meters double
T2M_MAX Maximum Temperature at 2 Meters (C) double
WS50M Wind Speed at 50 Meters (m/s) double
TS Earth Skin Temperature (C) double
WS50M_RANGE Wind Speed Range at 50 Meters (m/s) double Check collinearity with WS50M features
WS50M_MAX Maximum Wind Speed at 50 Meters (m/s) double
WS10M_MAX Maximum Wind Speed at 10 Meters (m/s) double
WS10M_RANGE Wind Speed Range at 10 Meters (m/s) double Check collinearity with WS10M features
PS Surface Pressure (kPa) double
T2MDEW Dew/Frost Point at 2 Meters (C) double Might be related to T2MWET
T2M_MIN Minimum Temperature at 2 Meters (C) double
T2MWET Wet Bulb Temperature at 2 Meters (C) double
PRECTOT Precipitation (mm/day) double Might have been useful to have mm/h which is the standard measure for precipitations
fips US county FIPS code int More info at: https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697
date observation date string to be changed to datetime
score Drought level double to be changes to int, categorical (ordinal) feature based on levels from 0 to 5

Checking missing values

None of the features is missing values apart from our target feature. As said before, it is missing ~85% of its values. Let's have a look at the original data to explain this:

As we can see the data is collected weekly (every Tuesday). Even if we are in the exploration phase, we can already think of two ways of solving this:

Check Data Noise, Distibuition, Univariate Analysis

Let's evaluate now a summary of statistics of our dataset to make some considerations:

#SPARK IMPLEMENTATION
describe_df = train.describe().toPandas().T

describe_df.columns = describe_df.iloc[0] #setting up the columns
describe_df.drop(labels="summary",axis=0, inplace=True)

#adding quantiles to the summary
columns = ['fips',
 'PRECTOT',
 'PS',
 'QV2M',
 'T2M',
 'T2MDEW',
 'T2MWET',
 'T2M_MAX',
 'T2M_MIN',
 'T2M_RANGE',
 'TS',
 'WS10M',
 'WS10M_MAX',
 'WS10M_MIN',
 'WS10M_RANGE',
 'WS50M',
 'WS50M_MAX',
 'WS50M_MIN',
 'WS50M_RANGE',
 'score']
quantiles = train.approxQuantile(columns, [0.25,0.5,0.75], 0.05) #0.05 beacuse 0 would be very expensive computationally

#renaming columns
quantiles_df = pd.DataFrame(quantiles)
quantiles_df.index = columns
quantiles_df.columns = [0.25,0.5,0.75]

pd.concat([describe_df, quantiles_df], axis=1)

From this table alone we can already see 2 things:

Let's study each feature one by one:

FIPS

Fips are the American county codes. (https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697). Let's see how many we have:

#SPARK IMPLEMENTATION
train_df.select(F.col("fips")).distinct().count()

I think it would be interesting to have the correspondent longitude and latitude for each fips. This could help us also in visualizing different features compared to the zone. After exploring all the features, we will take into consideration adding more data.

Date

Our data goes from:

Out of curiosity, we decided to see how the temperature is evolving through the last 20 years:

We can se how there's a slight positive slope, clearly idicating how the tendency is towards a general temperature increase.

PRECTOT

PRECTOT qualify the total amount of rain in mm/day. As we saw in the resume before it looks quite skewed, and the 50% of the data has 0.018 or less mm/day of rain. Generally speaking, the convention for measuring rain is mm/h; that gives an idea of the intensity of the rainfall and could help understanding if we had a storm or a slight rain for long hours during the day.

From the summary we could infer that this feature was quite skewed. Let's confrim this:

#SPARK IMPLEMENTATION
train.select(F.skewness("PRECTOT"),F.kurtosis("PRECTOT")).show()

It is indeed quite skewed with a very high Kurtoisis. To understand what could be considered as outlier, we can create boundaries using the Inter-Quartile Range:

#SPARK IMPLEMENTATION

#Let's evaluate more quantiles
quantiles = train.approxQuantile(["PRECTOT"], [0.10,0.25,0.5,0.6,0.75, 0.85, 0.95], 0.05)
quantiles

def IQR(df, cols:list):
    """This function computes the Interquartile Range and the boudries that help discriminating outliers
    """
    bounds = {}
    for col in cols:
        quantiles = df.approxQuantile(col, [0.25, 0.75], 0.05)

    IQR = quantiles[1] - quantiles[0]

    #Anything below the lower limit and above the upper limit is considered an outlier.
    bounds[col] = [
     quantiles[0] - 1.5 * IQR, #upper boundry
     quantiles[1] + 1.5 * IQR  #lower boundry
     ]

    return bounds[col]

IQR(train, ["PRECTOT"])

It looks like the data is positively skewed, and with a high Kurtoisis value (the higher the longer the tail). Also the IQR boudries show that everything above ~5.45mm/day precipitation is an outlier. Let's visualize this:

To compare the distribution of subsets that differ substantially in size, I will use indepdendent density normalization: (https://seaborn.pydata.org/generated/seaborn.histplot.html)

As we can see, there's a correlation with the presence of the rain and the drought level. For rains levels > 0 there's no separations between drought levels. We could think of binarizing the feature and have a 1 for rain and 0 for no-rain.

#SPARK IMPLEMENTATION
columns = ["PRECTOT","score"]
PRECTOT = train.select(*[columns]).toPandas()
#learn pyspark
sns.boxplot(data=PRECTOT.PRECTOT, orient="h")

plt.tight_layout()
#SPARK IMPLEMENTATION
train.filter("PRECTOT > 3.95").count()

~15% of the total daily precipitation goes above 5.45mm (75th percentile* 1.5 IQR). We can see that there's a certain outlier's continuity continuity up untill 100 mm/day. Outliers could be seen as floods or very heavy rains. There are various ways of dealing with outliers:

I think that going for the last one is the reasonebale solution for all the skewed features, without altring too much our dataset.

PS

This is the Surface Pressure (kPa). High-pressure systems reduce evaporation and moisture in the atmosphere so we will consider this feature as important. [11]

As we can see the distribuition of pressures with different drought levels is quite similar. We can see as well a spike of the maximum level of drought at about ~100KPa. Altough this plot is interesting, we can see that pressure is a not so high discriminant for drought as all the graphs are one on top of the other.

As we can see we have a high number of outliers below 89 KPa. Let's try to understand how many of them:

That's about 10% of the data, again, it's a considerable amount that we will address with a robust scaler.

TS

Earth Skin Temperature (C). Land Surface Temperature is an important variable within the Earth climate system. It describes processes such as the exchange of energy and water between the land surface and atmosphere, and influences the rate and timing of plant growth.[9]

Accurately understanding Land Surface Temperature at the global and regional level helps to evaluate land surface–atmosphere exchange processes in models and, when combined with other physical properties such as vegetation and soil moisture, provides a valuable metric of surface state. [10] This might suggest as well that more data about the soil and vegetation might be needed.

Here we can see how, at higher level of TS we get an increase in the density of higher levels of drought.

Median of TS is around 15 C. We can see how the data spred is between -20 and 40 C with outliers in the lower part up until -30C.

Thast's ~1% of the original data.

QV2M

Specific Humidity at 2 Meters (g/kg) represents the concentration of water vapor present in the air. [12]

The distribuition is right skewed, with a slight increase in the densities of drought levels between 10 and 15 g/Kg

T2MWET

The wet-bulb temperature is the lowest temperature that may be achieved by evaporative cooling of a water-wetted, ventilated surface. If a thermometer is wrapped in a water-moistened cloth, it will behave differently. The drier and less humid the air is, the faster the water will evaporate. The faster water evaporates, the lower the thermometer's temperature will be relative to air temperature.[13]

Here we can see a slight higher density between 10 and 20 C of the highes level of drought.

Here as well we have very few outliers.

T2MDEW

The dew point is the temperature to which air must be cooled to become saturated with water vapor, assuming constant air pressure and water content. When cooled below the dew point, moisture capacity is reduced and airborne water vapor will condense to form liquid water known as dew. When this occurs via contact with a colder surface, dew will form on that surface. [14]

As for T2MWET, we can se the same pattern, where the highest drought level is denser between 10 and 20 C.

It looks that for both T2MDEW and T2MWET higher temperatures present an higher level of drought density. Also I expect both variables to be higly correlated.

Bivariate and Multivariate analysis

Let's go ahead with our analysis and have a look of how they correlate between each other and if we can find come interesting patterns:

T2M, T2M_RANGE, T2M_MIN, T2M_MAX

Temperature, range, min and max at 2 meters will be considered all together because they are part of the same "family".

As expected, there's a strong linear correlation between the temperature, max and min. Distribuitions are slightly negatively skewed apart from the temperature range that has a normal distribuition. Generally multicollinearity is not welcome, and high correlated data should be taken care of. Again no strong separation between different levels of droughts for each single feature, but we can see a pattern for all the bivariate scatter plots where the higher levels of drought are in the upper right part of the grid, indicating an higher probability of droughts at higher temperatures. It is important to underline that due to the nature of the plot, we can't make precise affirmation because we don't know what's hiding behind the higher density points in the scatter plots.

WS50M, WS50M_MAX, WS50M_MIN, WS50M_RANGE

As before we treat the wind speed at 50m all together.

As per the temperatures, here we can see the positive correlation between the wind speed and the max wind speed. Compared to the temperatures plot, here we have a different spread of the data, so dropping them might not be the solution. From the data distribuition we can also see how a feature transformation could be positive and bring the distribuition to normal (beneficial for linear models). In this case appling a log transformation could be advisable. Furthermore there might be a pattern where highest levels of drought could be associated with lower wind speeds.

WS10M, WS10M_MAX, WS10M_MIN, WS10M_RANGE

These features are the same as the W50 except for the fact that now the distance is 10m.

Here, we can do the same considerations as for WS50M. Again we can see a strong linear correlation between WS10M_MAX and WS10M but with a non negligible spread of the data.

PRECTOT vs PS

We think it would be interesting showing the correlation between pressure, wind speeds and precipitations.

A bivariate scatter plot is not very indicated for skewd data, because we cant really see what's behind the lower right part of the plot with high density points. We will use a kernel density plot instead.

From this plot, we ca't really see a strong pattern that indicates how a PS-PRECTOT variation could influence drought. As we would expect the higher drought density is at higher pressures and low precipitations.

PRECTOT vs T2M

Let's analize now precipitations vs the temperature in the same way:

As we can see, the higher the temperature and the lower the precipitation, we get higher levels of droughts.

T2MWET vs T2MDEW

Let's visualize now the relation between the dew point and the wet bulb temperature, this time compared with the specific humidity.

The key difference between dewpoint and wet bulb temperature is that dewpoint temperature is the temperature to which we should cool the air to saturate the air with water vapor whereas wet bulb temperature is the temperature that we get from a moistened thermometer bulb that is exposed to air flow. Dew point and wet bulb temperatures are very important in indicating the state of the humid air.[15]

As expected, we have a strong linear correlation with T2MWET, T2MDEW.

We can see a linear correlation between temeperatures and a logarithmic correlation with QV2M.

Is extra data useful?

Before going ahead, we might ask ourselves if extra data would be usefult. As we said before, maybe that soild that could be helpful, as knowing the amount of vegetation. Therefore we will add to our dataset the Harmonized World Soil Database.

Indicator Description
Latitude Minimum Wind Speed at 10 Meters (m/s)
Longitude Specific Humidity at 2 Meters (g/kg)
Elevation Median Elevation (meters)
slope1 0 % ≤ slope ≤ 0.5 %
slope2 0.5 % ≤ slope ≤ 2 %
slope3 2 % ≤ slope ≤ 5 %
slope4 5 % ≤ slope ≤ 10 %
slope5 10 % ≤ slope ≤ 15 %
slope6 15 % ≤ slope ≤ 30 %
slope7 30 % ≤ slope ≤ 45 %
slope8 Slope > 45 %
aspectN North: 0˚< aspect ≤45˚ or 315˚< aspect ≤360˚
aspectE East: 45˚ < aspect ≤ 135˚
aspectS South: 135˚ < aspect ≤ 225˚
aspectW West: 225˚ < aspect ≤ 315˚
aspectUnknown Undefined: Slope aspect undefined; this value is used for grids where slope gradient is undefined or slope
WAT_LAND mapped water bodies
NVG_LAND barren/very sparsely vegetated land
URB_LAND built-up land (residential and infrastructure)
GRS_LAND grass/scrub/woodland
FOR_LAND forest land, calibrated to FRA2000 land statistics
CULTRF_LAND
CULTIR_LAND irrigated cultivated land, according to GMIA 4.0
CULT_LAND total cultivated land
SQ1 Nutrient availability
SQ2 Nutrient retention capacity
SQ3 Rooting conditions
SQ4 Oxygen availability to roots
SQ5 Excess salts.
SQ6 Toxicity
SQ7 Workability (constraining field management)

It seems that we have some interesting features. My attention went towards these:

Let's have a look at the data:

Let's join our data exploration datasets for a more indepth study:

Let's visualize now the droughts levels on a map using latitude and longitude.

From this plot we can see that:

Let's have a look now on where most of the outliers in PRECTOT had place:

Looks like south east is the most hit. It might be due as well to the hurricane season. Let's confirm this by plotting the winds.

There is some correspondence on the south east coast, but anyway it seems that the highest winds are up north. Please ote that yellow dots indicate speed from 50 to 90 km/h.

Correlation

Let's have a look now at the correllation between features:

As we can see here, there are some strong correlations:

Given these informations, we could think about dropping these features to avoid multicollinearity. One alternative is to use PCA before using any mode and let the algorithm "choose" which are the most importants.

Also having a look at the heatmap we can see the there is not any strong liner correlation of the features with the drought levels.

Data Exploration Conclusions

After the data exploration we can say that:

  1. One of the first steps to perform in a Big Data enviroment, is to filter our dataset to avoid using too many resourcers on our cluster. Also following Occam's razor: with no significant difference in performance, simpler models should be preferred. Therefore we will get rid of the following features:
    • There is a strong correlation between T2M, T2M_MAX, T2M_MIN, TS. We will keep only T2M as it's explicative of the temperature feature.
    • Same with WS10M and WS50M. We will drop WS10M_MAX, WS50M, WS50M_MAX
    • lat/ lon are better and more correlated with score than fips
    • QV2M, TDEW TWET are all faces of the same medal and highly correlated with T2M features, we can drop them.
    • elevation and PS are dedundant.
    • SQ features are highly correlated. We can drop SQ2 and SQ7
    • CULT_LAND and CULTRF_LAND are redundant
    • slope2 and aspectUnknown are redundant
    • Use of PCA to reduce the amount of features in case of linear model
  2. Feature skeweness:
    • Linear models might need a normally distribuited data. That implies feature transformation.
    • Another option could be bucketizing our features
    • A robust scaler might be needed before training any model. We will choose this option for semplicity.
  3. Feature engineering:
    • Create week of the year and year feature and remove date: this will help grouping our data
    • Since our target is collected on a weekly basis we might group it accordingly. This will help as well dealing with all the missing targets entries as well.
  4. Possible models:
    • Decision Tree
    • Random Forest
    • OneVsRest
  5. Imbalanced Target:
    • We are dealing with highly imbalanced data. Even after performing a grouping we'll still have imbalanced target towards 0s
    • Possible solutions are: Undersampling/Oversampling/SMOTE or use a weight feature. We will use the latter.

Prepare the Data

To speed up our analysis process, let's build functions that we can easly use in the pipeline on our train/test/val datasets.

Now let's group our data on a weekly basis:

As we can see, by grouping our data on a weekly basis we reduced our dataset to 15% of the data.

Now we will join the soil dataset with our grouped data:

For analysis purposes, we are going to create a new feature "weights" to account for imbalanced data. Let's see our label distribuition at this point:

Let's create now a weight feature for our labels:

To proceed with our study, we need to vectorize our features and then apply our desired transormations. To achieve this we will use another pipeline:

Shortlisting Promising Models

Only few models support multiclass classification, here we will use Decision Tree, Random Forest and One vs Rest with Logistic Regression.

Decition Tree Classifier

Random Froest

One vs Rest with Logistic Regression

Reduction of Multiclass Classification to Binary Classification. Performs reduction using one against all strategy. For a multiclass classification with k classes, train k models (one per class). Each example is scored against all k models and the model with highest score is picked to label the example. [16]

Here for the One Vs Rest model we will use a different strategy. Since it is very expensive computationaly, we will downsample our model so that all the classes have the same amout of entries. Also, as we are going to use Logistic Regression, we will use PCA for future reduction a speed up the evaluation, since this model favor a smaller dimension dataset.

We didn't do this for the tree model because they would have suffered more from the data amout reduction.

As we can see 3 features are explaining more than 90% of the variance of the data.

Since we balanced out or dataset, here we don't need the weights column.

Validation

Please note: The following fine tuning code is commented out due to the excessive time to run it. It improved the validation score by ~4%, confirming that there still some margin to improve.

Fine Tune the System

So far our best model is Random Forest. Let's perform hyperparameter tuning:

# let's reduce the dataframe
columns = ["score","weights", "scaled_features_robust"]
train_vectorized_red = train_vectorized.select(*columns).alias("train_vectorized_red")
# train = train_vectorized_red.repartition(2) #optimzie in memory partition
train_vectorized_red.cache()
train_vectorized_red.show(5)

# building the grid search parameters
grid = (ParamGridBuilder()
                .addGrid(rf.maxDepth, [15, 20])
                .addGrid(rf.numTrees, [20, 50])
             .build())

#evaluator
evaluator = MulticlassClassificationEvaluator(
    predictionCol="prediction",
    labelCol="score",
    metricName="f1"
)

In this case, instead of Cross validation, we will use Train Test split. TrainValidationSplit only evaluates each combination of parameters once, as opposed to k times in the case of CrossValidator. It is, therefore, less expensive, but will not produce as reliable results when the training dataset is not sufficiently large. [17]

#TrainValidationSplit
tvs = TrainValidationSplit(estimator=rf,
                           estimatorParamMaps=grid,
                           evaluator=evaluator,
                           # 85% of the data will be used for training, 15% for validation.
                           trainRatio=0.85,
                           parallelism=2, #threadpool amount
                           seed=0
                          )

# # cross validation
# cv = CrossValidator(
#     estimator=rf,
#     estimatorParamMaps=grid,
#     evaluator=evaluator,
#     parallelism=6,
#     seed=0
# )

#fitting the model
cvModel = tvs.fit(train_vectorized_red)

#evaluating predictions
cv_predictions = cvModel.transform(test_vectorized)

get_metrics(cv_predictions)

#checking the best parameters 
cvModel.bestModel

#NB only valid for cross validation
# cv_predictions = [
#  (
#  [
#  {key.name: paramValue} 
#  for key, paramValue 
#  in zip(
#  params.keys(), 
#  params.values())
#  ], metric
#  ) 
#  for params, metric 
#  in zip(
#  cvModel.getEstimatorParamMaps(), 
#  cvModel.avgMetrics
#  )
# ]

# sorted(cv_predictions, 
#  key=lambda el: el[1], 
#  reverse=True)[0]

print("Best Depth: ", cvModel.bestModel._java_obj.getMaxDepth())
print("Best numTrees: ", cvModel.bestModel._java_obj.getNumTrees())

Validation

cv_validation = cvModel.transform(val_vectorized)

get_metrics(cv_validation)

Conclusions

We started by analysing feature by feature our dataset. We noticed how some key predictors like the total precipitations, or the surface pressure were highly skewed. We decided to assess the skewness of the data but to not directly remove outliers and implemented instead a robust scaler in our transformation pipeline. Furthermore, we didn’t seek to normalize the distribution of our features, something that is considered beneficial especially for linear models, because that would have influenced excessively the interpretation of our data. We then removed highly correlated features that could have caused multicollinearity problems and decided to add new features relative to the soil status, exploring our they were correlated to the various drought levels. It is important to underline how none of the features was strongly linearly correlated with our target, but nonetheless that could not exclude any nonlinear correlation. Moreover, during our exploration we noticed how our data was highly imbalanced towards Nan values and discovered that the reason was that the target was being collected on a weekly basis. Thus, we decided to group our entries on a yearly and weekly basis per each county code and although now we didn’t have any null value and with a dataset strongly reduced in its size, that didn’t keep away the fact that the target was still imbalanced towards dry soil label. To solve this problem we followed two different strategies according to the different modles we were dealing with:

In the end, the better performing model was Random Forest, and we decided to perform an hyperparameter optimization trough cross validation tanking into account the best trade off between tree depth and the amount of trees. Please note that only few parameters have been choosen due to the high processin time. Ideally 3 values per hyperparameyer are advisable.

At the end, after the validation task, we can see how the f1-score is ~57%, with a weighted false positive rate of ~36%. All in all, we consider the result of our analysis positive, since we managed to have an average prediction probability superior to the one of a toss of a coin. Nonetheless, it is important to note that most of the models performed badly on classifying all the classes but the 0 class. To solve this problem, one solution could be perhaps grouping all the classes > 0 to a single class 1, reducing the problem to a binary class problem. Other areas where we could seek improvement are the ones involving different models (eg. XGBoost) or transforming our feature more aggressively performing capping for the features with the most outliers or trying to normalizing the features with some transformations (eg. log, sqrt, etc). We can then resume our possible future improvements as follows:

References

[1] STELLA LEVANTESI, 2022, 14 July. Italy must prepare for a future of chronic drought. Nature Italy.

[2] France drought: Parched towns left short of drinking water. BBC News.[05/08/, 2022]

[3] FT REPORTERS, 2022, AUGUST 5. Europe battles water shortages as severe drought sweeps the continent. Financial Times.

[4] U.S. Drought Monitor. Available: https://droughtmonitor.unl.edu/.

[5] Measuring Drought. Available: https://drought.unl.edu/ranchplan/DroughtBasics/WeatherandDrought/MeasuringDrought.aspx [07/08/, 2022].

[6] What is a drought and what causes it?. Available: https://gpm.nasa.gov/resources/faq/what-drought-and-what-causes-it#:~:text=Droughts%20are%20caused%20by%20low,factors%20that%20contribute%20to%20drought [07/08/, 2022].

[7] RICHARD RESTUCCIA, , 5 Causes Of Drought . Available: https://jainsusa.com/blog/drought/ [08/-8/, 2022].

[8] RINKESH, , Causes, Effects and Solutions to Drought . Available: https://www.conserve-energy-future.com/causes-effects-solutions-drought.php [08/08/, 2022].

[9] WIKIPEDIA, c-last update, Skin temperature (of an atmosphere) . Available: https://en.wikipedia.org/wiki/Skin_temperature_(of_an_atmosphere)#:~:text=The%20skin%20temperature%20is%20a,tropopause%20temperature%20of%20209%20K. [08/08/, 2022].

[10] ESA, , Land Surface Temperature. Available: https://climate.esa.int/en/projects/land-surface-temperature/ [08/08/, 2022].

[11] BBC, , Droughts. Available: https://www.bbc.co.uk/bitesize/guides/ztvvmnb/revision/1 [08/08/, 2022].

[12] WIKIPEDIA, b-last update, Humidity . Available: https://en.wikipedia.org/wiki/Humidity [08/08/, 2022].

[13] WIKIPEDIA, a-last update, Dew point . Available: https://en.wikipedia.org/wiki/Dew_point [08/08/, 2022].

[14] WIKIPEDIA, d-last update, Wet-bulb temperature. Available: https://en.wikipedia.org/wiki/Wet-bulb_temperature [08/08/, 2022].

[15] MADHU, , Difference Between Dewpoint and Wet Bulb Temperature. Available: https://www.differencebetween.com/difference-between-dewpoint-and-wet-bulb-temperature/#:~:text=The%20key%20difference%20between%20dewpoint,is%20exposed%20to%20air%20flow [08/08/, 2022].

[16] WENQIANG FENG, 2021. Learning Apache Spark with Python.

[17] Apache Spark, 2022. ML Tuning: model selection and hyperparameter tuning. Available: https://spark.apache.org/docs/latest/ml-tuning.html#train-validation-split

REBECCA LINDSEY and LUANN DAHLMAN, JUNE 28, 2022-last update, Climate Change: Global Temperature. Available: https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature [08/08/, 2022].